Abstract feature extraction for text classification

نویسندگان

  • Göksel BİRİCİK
  • Banu DİRİ
  • Ahmet Coşkun
چکیده

feature extraction for text classification Göksel BİRİCİK∗, Banu DİRİ, Ahmet Coşkun SÖNMEZ Department of Computer Engineering, Yıldız Technical University, Esenler, İstanbul-TURKEY e-mails: {goksel,banu,acsonmez}@ce.yildiz.edu.tr Received: 03.02.2011 Abstract Feature selection and extraction are frequently used solutions to overcome the curse of dimensionality in text classification problems. We introduce an extraction method that summarizes the features of the document samples, where the new features aggregate information about how much evidence there is in a document, for each class. We project the high dimensional features of documents onto a new feature space having dimensions equal to the number of classes in order to form the abstract features. We test our method on 7 different text classification algorithms, with different classifier design approaches. We examine performances of the classifiers applied on standard text categorization test collections and show the enhancements achieved by applying our extractionmethod. We compare the classification performance results of our method with popular and well-known feature selection and feature extraction schemes. Results show that our summarizing abstract feature extraction method encouragingly enhances classification performances on most of the classifiers when compared with other methods.Feature selection and extraction are frequently used solutions to overcome the curse of dimensionality in text classification problems. We introduce an extraction method that summarizes the features of the document samples, where the new features aggregate information about how much evidence there is in a document, for each class. We project the high dimensional features of documents onto a new feature space having dimensions equal to the number of classes in order to form the abstract features. We test our method on 7 different text classification algorithms, with different classifier design approaches. We examine performances of the classifiers applied on standard text categorization test collections and show the enhancements achieved by applying our extractionmethod. We compare the classification performance results of our method with popular and well-known feature selection and feature extraction schemes. Results show that our summarizing abstract feature extraction method encouragingly enhances classification performances on most of the classifiers when compared with other methods.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

A New Approach for Text Documents Classification with Invasive Weed Optimization and Naive Bayes Classifier

With the fast increase of the documents, using Text Document Classification (TDC) methods has become a crucial matter. This paper presented a hybrid model of Invasive Weed Optimization (IWO) and Naive Bayes (NB) classifier (IWO-NB) for Feature Selection (FS) in order to reduce the big size of features space in TDC. TDC includes different actions such as text processing, feature extraction, form...

متن کامل

An Improved K-Nearest Neighbor with Crow Search Algorithm for Feature Selection in Text Documents Classification

The Internet provides easy access to a kind of library resources. However, classification of documents from a large amount of data is still an issue and demands time and energy to find certain documents. Classification of similar documents in specific classes of data can reduce the time for searching the required data, particularly text documents. This is further facilitated by using Artificial...

متن کامل

Document Analysis And Classification Based On Passing Window

In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...

متن کامل

Neural Network Based Recognition System Integrating Feature Extraction and Classification for English Handwritten

Handwriting recognition has been one of the active and challenging research areas in the field of image processing and pattern recognition. It has numerous applications that includes, reading aid for blind, bank cheques and conversion of any hand written document into structural text form. Neural Network (NN) with its inherent learning ability offers promising solutions for handwritten characte...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012